Serveur d'exploration sur l'OCR

Attention, ce site est en cours de développement !
Attention, site généré par des moyens informatiques à partir de corpus bruts.
Les informations ne sont donc pas validées.

A survey of document image classification: problem statement, classifier architecture and performance evaluation

Identifieur interne : 000F46 ( Main/Exploration ); précédent : 000F45; suivant : 000F47

A survey of document image classification: problem statement, classifier architecture and performance evaluation

Auteurs : NAWEI CHEN [Canada] ; Dorothea Blostein [Canada]

Source :

RBID : Pascal:07-0277587

Descripteurs français

English descriptors

Abstract

Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.


Affiliations:


Links toward previous steps (curation, corpus...)


Le document en format XML

<record>
<TEI>
<teiHeader>
<fileDesc>
<titleStmt>
<title xml:lang="en" level="a">A survey of document image classification: problem statement, classifier architecture and performance evaluation</title>
<author>
<name sortKey="Nawei Chen" sort="Nawei Chen" uniqKey="Nawei Chen" last="Nawei Chen">NAWEI CHEN</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>School of Computing, Queen's University</s1>
<s2>K7L 3N6, Kingston, ON</s2>
<s3>CAN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Canada</country>
<wicri:noRegion>K7L 3N6, Kingston, ON</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Blostein, Dorothea" sort="Blostein, Dorothea" uniqKey="Blostein D" first="Dorothea" last="Blostein">Dorothea Blostein</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>School of Computing, Queen's University</s1>
<s2>K7L 3N6, Kingston, ON</s2>
<s3>CAN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Canada</country>
<wicri:noRegion>K7L 3N6, Kingston, ON</wicri:noRegion>
</affiliation>
</author>
</titleStmt>
<publicationStmt>
<idno type="wicri:source">INIST</idno>
<idno type="inist">07-0277587</idno>
<date when="2007">2007</date>
<idno type="stanalyst">PASCAL 07-0277587 INIST</idno>
<idno type="RBID">Pascal:07-0277587</idno>
<idno type="wicri:Area/PascalFrancis/Corpus">000346</idno>
<idno type="wicri:Area/PascalFrancis/Curation">000440</idno>
<idno type="wicri:Area/PascalFrancis/Checkpoint">000289</idno>
<idno type="wicri:doubleKey">1433-2833:2007:Nawei Chen:a:survey:of</idno>
<idno type="wicri:Area/Main/Merge">000F60</idno>
<idno type="wicri:Area/Main/Curation">000F46</idno>
<idno type="wicri:Area/Main/Exploration">000F46</idno>
</publicationStmt>
<sourceDesc>
<biblStruct>
<analytic>
<title xml:lang="en" level="a">A survey of document image classification: problem statement, classifier architecture and performance evaluation</title>
<author>
<name sortKey="Nawei Chen" sort="Nawei Chen" uniqKey="Nawei Chen" last="Nawei Chen">NAWEI CHEN</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>School of Computing, Queen's University</s1>
<s2>K7L 3N6, Kingston, ON</s2>
<s3>CAN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Canada</country>
<wicri:noRegion>K7L 3N6, Kingston, ON</wicri:noRegion>
</affiliation>
</author>
<author>
<name sortKey="Blostein, Dorothea" sort="Blostein, Dorothea" uniqKey="Blostein D" first="Dorothea" last="Blostein">Dorothea Blostein</name>
<affiliation wicri:level="1">
<inist:fA14 i1="01">
<s1>School of Computing, Queen's University</s1>
<s2>K7L 3N6, Kingston, ON</s2>
<s3>CAN</s3>
<sZ>1 aut.</sZ>
<sZ>2 aut.</sZ>
</inist:fA14>
<country>Canada</country>
<wicri:noRegion>K7L 3N6, Kingston, ON</wicri:noRegion>
</affiliation>
</author>
</analytic>
<series>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
<imprint>
<date when="2007">2007</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
<seriesStmt>
<title level="j" type="main">International journal on document analysis and recognition : (Print)</title>
<title level="j" type="abbreviated">Int. j. doc. anal. recognit. : (Print)</title>
<idno type="ISSN">1433-2833</idno>
</seriesStmt>
</fileDesc>
<profileDesc>
<textClass>
<keywords scheme="KwdEn" xml:lang="en">
<term>Ambiguity</term>
<term>Artificial intelligence</term>
<term>Character recognition</term>
<term>Data models</term>
<term>Document analysis</term>
<term>Document layout</term>
<term>Document management</term>
<term>Electronic library</term>
<term>Fuzzy logic</term>
<term>High performance</term>
<term>Image analysis</term>
<term>Image classification</term>
<term>Image processing</term>
<term>Learning algorithm</term>
<term>Modeling</term>
<term>Office automation</term>
<term>Optical character recognition</term>
<term>Performance evaluation</term>
<term>Typography</term>
</keywords>
<keywords scheme="Pascal" xml:lang="fr">
<term>Classification image</term>
<term>Traitement image</term>
<term>Bureautique</term>
<term>Bibliothèque électronique</term>
<term>Analyse documentaire</term>
<term>Analyse image</term>
<term>Intelligence artificielle</term>
<term>Reconnaissance caractère</term>
<term>Reconnaissance optique caractère</term>
<term>Haute performance</term>
<term>Evaluation performance</term>
<term>Présentation document</term>
<term>Typographie</term>
<term>Ambiguité</term>
<term>Gestion document</term>
<term>Modèle donnée</term>
<term>Algorithme apprentissage</term>
<term>Logique floue</term>
<term>Modélisation</term>
</keywords>
<keywords scheme="Wicri" type="topic" xml:lang="fr">
<term>Bureautique</term>
<term>Intelligence artificielle</term>
</keywords>
</textClass>
</profileDesc>
</teiHeader>
<front>
<div type="abstract" xml:lang="en">Document image classification is an important step in Office Automation, Digital Libraries, and other document image analysis applications. There is great diversity in document image classifiers: they differ in the problems they solve, in the use of training data to construct class models, and in the choice of document features and classification algorithms. We survey this diverse literature using three components: the problem statement, the classifier architecture, and performance evaluation. This brings to light important issues in designing a document classifier, including the definition of document classes, the choice of document features and feature representation, and the choice of classification algorithm and learning mechanism. We emphasize techniques that classify single-page typeset document images without using OCR results. Developing a general, adaptable, high-performance classifier is challenging due to the great variety of documents, the diverse criteria used to define document classes, and the ambiguity that arises due to ill-defined or fuzzy document classes.</div>
</front>
</TEI>
<affiliations>
<list>
<country>
<li>Canada</li>
</country>
</list>
<tree>
<country name="Canada">
<noRegion>
<name sortKey="Nawei Chen" sort="Nawei Chen" uniqKey="Nawei Chen" last="Nawei Chen">NAWEI CHEN</name>
</noRegion>
<name sortKey="Blostein, Dorothea" sort="Blostein, Dorothea" uniqKey="Blostein D" first="Dorothea" last="Blostein">Dorothea Blostein</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration
HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000F46 | SxmlIndent | more

Ou

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000F46 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     Pascal:07-0277587
   |texte=   A survey of document image classification: problem statement, classifier architecture and performance evaluation
}}

Wicri

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024